#Loaing packages
library(ggplot2)
library(knitr)
library(dplyr)
library(gridExtra)
library(ggExtra)
library(psych)
library(Simpsons)
library(memisc)
White wine is a wine whose color can be straw-yellow, yellow-green, or yellow-gold. It is produced by the alcoholic fermentation of the non-colored pulp of grapes, which may have a skin of any color.
The dataset is related to white variants of the Portuguese “Vinho Verde” wine.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent)
I will be analyzing this dateset to get an idea on how the quality of wine is affected by each of the variables and by combinations of theses variables.
I would be also interested to see how some of these features affect each other.
Let’s take a look at what type of variables we have here:
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Looks like we have all variables as numbers, even the quality is an integer, i thinks some categorical variables could help, i will create three:
A factor duplicate of quality
A Factor with four levels for quality rate:
quality <= 10, ‘Excellent’
quality <= 8, ‘Very Good’
quality <= 6, ‘Fair’
quality <= 4, ‘Very Bad’
0-4 : ‘Dry’
4-12 : ‘Medium dry’
12-45: ‘Medium’
45-66: ‘Sweet’
Now let’s take a look at the five numbers of every variable:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality rate q.fact
## Min. : 8.00 Min. :3.000 Very Bad : 183 3: 20
## 1st Qu.: 9.50 1st Qu.:5.000 Fair :3655 4: 163
## Median :10.40 Median :6.000 Very Good:1055 5:1457
## Mean :10.51 Mean :5.878 Excellent: 5 6:2198
## 3rd Qu.:11.40 3rd Qu.:6.000 7: 880
## Max. :14.20 Max. :9.000 8: 175
## 9: 5
## sweetness
## Dry :2097
## Medium dry:1975
## Medium : 825
## Sweet : 1
##
##
##
Median quality is 5 while the mean quality is 6.
Seeing the distribution of the variables in plots will give even a clearer idea.
The fixed.acidity looks like having almost a normal distribution. (after removing the highest outlier) the density line shows the peak as 6.45
I’m very interested to look at the other variables in the dataset.
The volatile.acidity is how much acetic acid in wine (too high of levels can lead to an unpleasant, vinegar taste)
The distribution looks slightly right skewed. with the highest values at 0.25
Small quantities of citric.acid can add ‘freshness’ and flavor to wines. The mode in this dataset is 0.3. there is a peak at 0.5 that might affect the taste and quality of the wine.
The residual sugar is the amount of sugar remains after fermentation ends. Based on that, we can tell if the wine is sweet or dry. The plot is right skewed the peaks are at 1g and 2g with low residual sugar.
I would like to see how is the distribution based on the sweetness levels:
This is interesting, most of the wine in the dataset is “Dry” and “Medium Dry” only a very small amount is “Sweet”
The chlorides is the amount of salt in wine. The higher frequency here is at around 0.04. the highest is 0.20. Too much salt will risk the quality
A beautiful almost normal distribution for the free.sulfur.dioxide, there is a very noticeable peak at 26. free.sulfur.dioxide prevents microbial growth and the oxidation of wine
This looks like a normal distribution to me, with the peak is at around 125
The density depends on the percent of alcohol and sugar cont, there few high frequency peaks in addition to the mode
pH has a beautiful normal distribution, without even removing any outliers.
It’s interesting that pH levels are from 2.7 to 3.9, which is a characteristic of white wine.
The sulphate has a strange distribution distribution. There are some low amounts at around 0.41 and 0.54 causing multiple peaks in the distribution, the highest is at 0.47
The alcohol level has an interesting distribution.There are no outliers. The dominating amount is around 9.1.There are many other high frequency values. It almost looks like a missing rectangle shape with all of these peaks.
After looking at all the input variables, I’m curious to see how the distribution of quality from 0-9 and the distribution per rate look like:
Very interesting: the higher frequency for quality is at 6, 5 is next. Looking at the ‘rate’ shows the dominant rate in the dataset is ‘Fair’ then ‘Very Good’.
The dataset contains only physicochemical (inputs) and sensory (the output) variables with 4898 observation.
It came originally with 12 variable. I added two categorical variables to give a sense of the data. I also added a factorial duplicate of quality.
The main focus would be to see which of the individual features affect quality most.
I believe alcohol levels has an impact on the taste and quality of wine.
The level of pH, citric acid and the other acidic properties have a big impact on the taste and therefore the quality of wine.
I am very interested to see how the the sweetness (residual sugar) contribute to the quality rating. i would think it’s a matter of taste and it differs per person, but I’m interested to take a look.
From a first glance at the individual variables. We can see that most of the wine samples are of ‘Fair’ quality (5-6), the second main quality is ‘Very Good’ (7-8)
First, i will need to have an understanding of all the pair relationships between the quantitative variables.
I think a correlation table will also help:
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## sulphates alcohol quality
## fixed.acidity -0.01714299 -0.12088112 -0.113662831
## volatile.acidity -0.03572815 0.06771794 -0.194722969
## citric.acid 0.06233094 -0.07572873 -0.009209091
## residual.sugar -0.02666437 -0.45063122 -0.097576829
## chlorides 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
## density 0.07449315 -0.78013762 -0.307123313
## pH 0.15595150 0.12143210 0.099427246
## sulphates 1.00000000 -0.01743277 0.053677877
## alcohol -0.01743277 1.00000000 0.435574715
## quality 0.05367788 0.43557472 1.000000000
The plot and table are interesting, they confirm some of my thoughts about the relationships between the variables.
There is a medium negative correlation between alcohol and density. Same with residual sugar, which makes sense since the later two are strongly correlated.
pH has weak positive correlations with total.sulfur.dioxide, sulphates, alcohol and quality
quality is positively correlated with pH, sulphates, alcohol and free.sulfur.dioxide
chloride, citric.acid, residual.sugar has a negative correlation with quality
I think putting this in individual plots will help more understand this:
The positive correlation between alcohol and quality is obvious in this plot. yey to more alcohol in wine!
Surprisingly, citric acid doesn’t look like having much effect on quality. It actually has a negative very week correlation of -0.009 which is unexpected for me. Maybe it’s because acidity makes the wine taste lighter and less rich and round
The quality of wine has a negative correlation with the amount of residual sugar in it.
Quality decreases with the increase of alcohol density, which makes sense, since density and residual sugar are strongly correlated
Not surprisingly, there is a negative correlation between the quality of wine and the level of salt in it (chlorides).
There is a negative correlation between volatile acidity and quality of wine. This makes sense as high amounts results in an unpleasant vinegar taste so it decrease quality of wine.
There is slight increase in free.sulfur.dioxide with the increase in quality.
Sulphate is positively correlated with quality, this also makes sense, since sulphate acts as an antioxidant
pH levels have a positive correlation with the quality of wine, which confirm the negative correlation between quality and citric.acid.
I would like to see some of he relationships between the variables.
Since alcohol is the strongest correlation with quality, I’m curious to see what relationships it has with some of he other variables:
alcohol has a strong negative correlation with residual sugar, and has medium negative correlations with citric.acid and with fixed.acidity.
The relationship for volatile.acidity with alcohol is surprising, i thought it will be decreasing by the increase of alcohol considering it’s negative relation with quality, but it shows a slight increase.
Let’s see how the pH behave with other variables:
This is great and supports our earlier findings. pH has a negative correlation with citric.acid and density, and has a positive relation with alcohol and free.sulfur.dioxide.
The four variables with positive correlation with quality (weak to strong) are:
alcohol has the strongest positive correlation with the quality of wine. second is sulphate
density and residual sugar are strongly correlated. so is free.sulfur.dioxide and total.sulfur.dioxide
quality is negatively correlated with citric.acid. this was surprising for me.
alcohol and pH are positively correlated and both have negative relation with residual sugar and citric.acid.
I think it makes sense to start by looking at the two strongest positive relation with quality:
Looks like combining higher amount of alcohol with higher amount of sulphates increase the quality of wine.
What about pH and alcohol combination with quality?
High level of pH with more alcohol also increases the quality of wine.
I believe pH and sulphates should have the same relationship, let’s check:
As expected, high sulphates amount combined with pH increase the quality of wine.
citric.acid relationship with quality was a surprise for me, but we confirmed it has the same relationship with alcohol. I’m curios to see the effect of alcohol with volatile.acidity on quality.
The more alcohol in wine and the less volatile.acidity, the better the quality.
How about we try with pH:
The same, quality increases with low volatile.acidity and higher pH
hmm, i wonder if replacing the pH with fixed.acidity will change that:
Not really, there are no effect on quality here.
free.sulfur.dioxide has a week positive correlation with quality, let’s add pH to the combination:
Higher free.sulfur.dioxide and higher alcohol means higher wine quality.
I would like to see how some relationships looks like when adding sweetness categories:
Low density and low sweetness with high alcohol levels make a better quality wine
Sweet wine with low alcohol and low pH results is bad quality.
Dry and medium dry wine with specially higher alcohol level and higher PH increases the quality.
same as before, sweet wine with low alcohol and low sulphates is bad quality.
Dry and medium dry wine with higher alcohol level and higher sulphates increases the quality.
Let’s fit our linear model with the features that has the strongest correlation with quality.
I’ll be creating a simple linear model for the quality with the strongest relation; alcohol:
##
## Call:
## lm(formula = quality ~ alcohol, data = newdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5317 -0.5286 0.0012 0.4996 3.1579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.582009 0.098008 26.34 <2e-16 ***
## alcohol 0.313469 0.009258 33.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared: 0.1897, Adjusted R-squared: 0.1896
## F-statistic: 1146 on 1 and 4896 DF, p-value: < 2.2e-16
For every 8 mg of alcohol, the quality increases by 0.31
We have a very small SE of 0.009 of the variation of alcohol/quality relation
The p value is very significant, allows us to conclude a relationship between alcohol and quality
We have a 0.7973 residual SE, in which the change in quality will not follow a change in alcohol
The actual quality can deviate from the regression line by 0.797 points, so if we say that the average quality in the dataset is 2.582 and the residual SE is 0.7973, the the percentage error (any prediction will still be off by) 30.879%
It would be a good idea to add more variables to the linear model:
##
## Call:
## lm(formula = quality ~ alcohol + sulphates + pH + free.sulfur.dioxide,
## data = newdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5803 -0.5167 -0.0190 0.4829 3.2150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3030448 0.2513805 5.184 2.26e-07 ***
## alcohol 0.3327254 0.0095308 34.911 < 2e-16 ***
## sulphates 0.3803080 0.1001245 3.798 0.000147 ***
## pH 0.2094864 0.0761404 2.751 0.005957 **
## free.sulfur.dioxide 0.0062963 0.0006853 9.188 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7882 on 4893 degrees of freedom
## Multiple R-squared: 0.2085, Adjusted R-squared: 0.2079
## F-statistic: 322.3 on 4 and 4893 DF, p-value: < 2.2e-16
The p value is very significant, allows us to conclude a relationship between quality and all of the variables.
We have a 0.788 residual SE, in which the change in quality will not follow a change in alcohol
The actual quality can deviate from the regression line by 0.7882 points, so if we say that the average quality in the dataset is 1.303 and the residual SE is 0.7882, the the percentage error (any prediction will still be off by) 60.49%
The strongest relationships are for alcohol, sulphates and pH. the combination of each two with quality. The higher the amount of these, the better the quality of wine.
The relationship for volatile.acidity with alcohol is surprising, i thought it will be decreasing by the increase of alcohol considering it’s negative relation with quality, but it shows a slight increase.
I created a simple linear model for alcohol vs. quality. I also created a multiple linear model for the feature with the positive correlation with quality.
The results confirm the relationship between the variables and quality.
It’s important to start by taking a look at the distribution of quality in the dataset. each entry is an average of three experts rating. Most of the wine falls into the Fair category (5-6) the second ranking is for Very Good rate (7-8)
It’s good to keep this in mind, as no matter the correlation with other variables, we are not reaching the level of excellent much.
Alcohol has the strongest positive relationship with quality. It means that better quality wine has higher percentage of alcohol. although most of our dataset is in Fair or Very Good quality, as per our linear regression, alcohol contribute to around 19% to the change in wine quality.
Alcohol and sulphates have the strongest relation with quality. The quality increases by the increase of each or both of them, it’s worth looking at the relation of these three together and the sweetness of wine.
The quality increases by the increase of alcohol and the increase of sulphates, while the sweetness decreases.
Dry and medium dry wine with higher alcohol level and higher sulphates increases the quality.
The dataset contains only physicochemical and sensory data with 4898 wine sample related to white variants of the Portuguese “Vinho Verde” wine.
At first, i wanted to get a sense of the data, so i started by checking the structure, summary of the data and creating plots to understand distribution of each individual variable.
I made some plots to more explore the visuals, and ended up exploring relationships between the quality of wine and many of the other variables. This helped me defining the strongest relationships in the dataset, which i explored more with multiple variables plots.
I found out that the four variables with positive correlation with quality are:
I was curious to see the effect of other variables on alcohol, while i saw some expected results, like the negative correlation with residual.sugar, citric.acid and fixed.acidity, i was surprised with the positive relationship with volatile.acidity. Adding quality to the combination, turned the relation to a negative one as expected, since quality has a strong positive correlation with alcohol and a negative one with volatile.acidity.
I also looked at the relationship of pH with some of the variables, the results were as expected,
I also created a simple linear model for quality with the strongest relation: alcohol, this helped me figuring out the contribution of alcohol in the change of quality. I also created a multiple linear model for the main variables that positively affect h quality, this confirmed the relationship and showed their contribution to the change in quality.
Some limitations was that i don’t think this is complete dataset, as most of wines are of Fair or Very Good Qualities (between 5 and 8 rating) I think with larger dataset with more experts to rate and more variety of wines, the dataset would be more interesting to analyse, especially if e can have the individual ratings included in the dataset rather than the average. It would also help if the criteria for the rating is provided.